Joint Bayesian estimation of cell dependence and gene associations in spatially resolved transcriptomic data

Recent technologies such as spatial transcriptomics, enable the measurement of gene expressions at the single-cell level along with the spatial locations of these cells in the tissue. Spatial clustering of the cells provides valuable insights into the understanding of the functional organization of the tissue. However, most such clustering methods involve some dimension reduction that leads to a loss of the inherent dependency structure among genes at any spatial location in the tissue. This destroys valuable insights of gene co-expression patterns apart from possibly impacting spatial clustering performance. In spatial transcriptomics, the matrix-variate gene expression data, along with spatial coordinates of the single cells, provides information on both gene expression dependencies and cell spatial dependencies through its row and column covariances. In this work, we propose a joint Bayesian approach to simultaneously estimate these gene and spatial cell correlations. These estimates provide data summaries for downstream analyses. We illustrate our method with simulations and analysis of several real spatial transcriptomic datasets. Our work elucidates gene co-expression networks as well as clear spatial clustering patterns of the cells. Furthermore, our analysis reveals that downstream spatial-differential analysis may aid in the discovery of unknown cell types from known marker genes.

the spatial clustering of the single cells.The existing spatial clustering methods perform dimension-reduction, either prior to clustering or simultaneously and hence, do not have provisions for understanding the genetic association.More concretely, the expression data observed for a set of p genes over a relatively large number n of single cells, constitute a matrix of expression data.The expression data are also accompanied with the n × d spatial co-ordinates of the single cells, where the dimension d = 2 or 3 depends on the profiling method used.The matrix-variate spatial transcriptomic data provide information on both gene expression dependencies and cell spatial dependencies through the row and column covariances or correlations of the matrix-variate data.
Gaussian processes 13 are commonly used to model spatial data, which typically involve the specification of spatial dependence in the form of a covariance matrix/kernel.Existing spatial covariance estimation methods ignore the dependency structure among the rows (genes in our case) of the matrix-variate data and often rely on a parametric assumption on the covariance kernel.The accuracy of covariance estimation may be sensitive to the specification of such kernels.SpatialDE 14 , SPARK 15 , and BOOST-GP 16 adopt Gaussian processes with prespecified parametric kernels to identify spatially varying genes.Moreover, genes are considered one-at-a-time to identify their spatial expression pattern.This possibly ignores interesting spatial expression patterns induced by co-expressing genes.In this paper, we propose a JOint BayeSian (JOBS) approach to simultaneously estimate the row and column covariances for the matrix-variate spatial transcriptomic data without fixing a parametric column covariance kernel or assuming the rows to be independent.Moreover, the proposed approach is computationally efficient for a large number of spatial locations (i.e., cells).
The proposed method (illustrated schematically in Fig. 1) takes as input the spatial gene expression matrix after standard log-normalization and the spatial coordinates of the single cells in the tissue.The JOBS output consists of the joint posterior estimates of both the row and column covariances for the matrix-variate spatial transcriptomic data.These posterior row and column correlation matrices are summaries of gene and cell dependencies, respectively.These outputs may be further processed and used for downstream analyses.For example, the estimated cell correlations (column correlation matrix in our case) may be used for jointly predicting the spatial distribution of a set of genes in the tissue whereas the estimated gene correlation matrix (corresponds to our row correlation matrix) may be used to reveal the gene co-expression patterns.As an illustration, the Figure 2 shows the observed and JOBS predicted spatial topology of the gene "SCGB1D2" in the dorsolateral prefrontal cortex (DLFPC) of the adult human brain 17 .These fitted gene expression data may be considered as a "de-noised" or smooth representation of the raw gene expression data and may be used for further downstream analyses.For example, the smoothed gene expression data can be used for spatial clustering of the cells in the tissue.Existing methods of estimating gene co-expression network require the assumption that cell types (cluster labels) are either known or can be obtained via some existing spatial clustering method 18 .Our joint modeling approach circumvents this requirement by simultaneously providing the outputs for cell-type labelling and genenetwork estimation.Our findings indicate that accurate estimation of the spatial correlation matrix is essential for achieving accurate cell clustering.Furthermore, we observed that misrepresenting gene correlations, such as assuming independence (uncorrelated), significantly impacts the estimation of spatial covariance.Overall, this article underscores the importance of precise spatial covariance estimation and highlights the detrimental effects of misrepresenting gene correlations.Additionally, our findings provide strong evidence supporting the superiority of our joint modeling approach in achieving improved cell clustering.Moreover, we extend our method for cells collected from multiple independent tissue samples through a Bayesian hierarchical model, which allows for the sharing of information across tissue samples even though the cell spatial locations could be different from tissue to tissue.
In this paper, we first performed detailed simulation experiments, comparing the performance of our proposed method with the existing spatial covariance estimation method in Section Simulations.We present an analysis of a real spatial transcriptomic dataset collected from the STARmap platform 5 in Section STARmap data.In these studies, we demonstrated the effectiveness of our joint modeling approach, which incorporates both spatial and genomic level correlations, surpassing existing clustering methods.Additionally, we applied JOBS on two different spatial transcriptomics data obtained from the 10x Genomics Visium platform 17,19 .We discuss our findings and future directions for this work in Section Discussion.Section Discussion.provides a brief overview of our proposed joint Bayesian model for the case of a single-sample spatial transcriptomic data, and its extension to the case where we have multiple independent samples on a common gene set.The detailed description of our methodology, technical details of our hierarchical Bayesian model, and detailed simulation results can be found in the Supplementary.

Simulations
The detailed simulation setup and its corresponding results are provided in the Supplementary Section D. We performed two sets of simulations to evaluate the performance of JOBS and compared it with a spatial covariance estimation approach ignoring the correlations among the genes (rows) 20 , called NPVecchia.We note that the estimated spatial correlation matrix can be used for spatial clustering of the cells in the tissue.Hence, its accurate estimation is essential to achieve precise spatial clustering.It is worthwhile to mention that existing spatial clustering methods that rely on PCA for dimension reduction consider these uncorrelated principal components for spatial clustering.Apart from destroying the inherent dependence between genes, we conjecture that using these uncorrelated principal components can lead to inefficient spatial clustering.In our first set of simulations, we consider the case of a single sample of spatial transcriptomic data.We considered a wide range of simulation settings, with different choices of the true spatial covariance and gene covariance structures of the matrix-variate data.To monitor the accuracy of estimation of the spatial and gene correlation matrices, we compared the KL divergence (in log scale) and the relative Frobenius error.The technical definitions of KL divergence and relative Frobenius errors are provided in the Supplementary.We further, performed independent replications of our simulation experiments and reported the mean and standard deviation of the two metrics over the replications.From these replicated simulations, we found that in situations where the genes are correlated, the accuracy of estimation of both the gene and spatial correlation matrices is significantly higher for JOBS than under the NPVecchia.Thus, considering genes to be uncorrelated impacts the spatial correlation estimation, which in turn might have an effect on spatial clustering.We also note that as the number of spatial locations (single cells) increases, the accuracy of estimation of the gene correlations increases as can be seen from the corresponding decreasing relative estimation error.We refer the reader to Supplementary Section D.1 for the detailed results.
We extended JOBS to the case where there are multiple independent samples of spatial transcriptomic data.Specifically, we have independent samples of spatial transcriptomic data measured on the same set of genes over possible different spatial locations across samples.In our next set of simulations, we looked at the estimation accuracy of the covariance matrices in the presence of multiple independent samples of spatial transcriptomic data.For simplicity, we considered three independent samples of spatial data on the same set of genes (p) over possibly different spatial locations.The detailed simulation setup and results are presented in Supplementary Section D.2.We see that JOBS reports a significantly smaller estimation error of the covariance matrices than that from the NPVecchia.Moreover, the estimation error of the spatial covariance matrices decreases as the number of genes increases, whereas it shows an increasing trend for the competing method.Besides, as the number of spatial locations (single cells in our case) increases, the estimation error of the gene correlations decreases.Moreover, the estimation accuracy is higher than the case of a single sample of spatial transcriptomic data, which highlights the importance of having multiple samples.
Furthermore, we looked at the scalability of JOBS for increasing number of cells and features/genes through multiple independent replications.We note that JOBS scales nearly linearly with the number of cells.Additionally, the simulations show that the runtime is sub-linear with the number of features/genes.The detailed results can be found in the Supplementary Section D. 3.
We note that although normalization is a standard pre-processing step for spatial transcriptomic data, the lognormalized matrix-variate data may be far from our assumed matrix normal distribution underneath our JOBS.We conducted sensitivity analysis for estimation accuracy, when the underlying data distribution is non-normal.
In particular, we generated the data from a matrix-variate t distribution and looked at the efficiency of estimation for both one-sample and multi-sample case.As before, we considered a variety of number of spatial locations and varied the degrees of freedom of the corresponding matrix-variate t distribution.Clearly, from our results in Supplementary Section E, we see that the estimation performance under JOBS is better than that obtained from NPVecchia.It is worthwhile to note that under the mis-specified model, the estimation performance is sub-par in comparison to the case when the underlying data generating model is indeed matrix-variate normal.Also, an increase in the degrees of freedom of the matrix-variate t distribution shows an improved estimation performance, as such an increase in the degrees of freedom makes the data more "normal".Furthermore, even under the mis-specified model, as the number of independent samples increases, the estimation errors of the row correlation matrices decreases, highlighting the importance of multiple samples of spatial transcriptomic data.The spatial correlation matrices also show lower estimation errors in comparison to the single-sample case, highlighting the benefits of our proposed hierarchical Bayesian model.

STARmap data
We considered the STARmap (spatially-resolved transcript amplicon readout mapping) dataset 5 , which consists of data from four independent samples/mice.The experimental mice were dark housed for four days and then either exposed to light or kept in the dark for another one hour before obtaining measurements from the primary visual cortex.The data comprised of the expression of 160 genes with the number of cells varying between from 931 to 1167 for the four different samples.The spatial locations of these single cells in the tissue were also recorded.The STARmap study observed global induction of several known immediate-early genes in the primary visual cortex due to the light exposure as compared to the mice that were not exposed.This biologically interesting observation led us to focus our analysis on the two mice samples that were exposed to light.We refer to these as the "light" samples.
Genes that display spatial expression patterns in spatially resolved transcriptomic studies may help characterize the spatial transcriptomic landscape of complex tissues.Existing methods like BOOST-GP, SpatialDE, SPARK, and SPARK-X 21 can identify the spatial expression patterns of genes, commonly referred to as Spatial Expression (SE) analysis.SE analysis can help choose the genes that show high spatial variations.However, these methods consider one gene at a time to estimate its spatial expression pattern.In many cases, there may exist co-expressing genes that induce interesting spatial distribution patterns.This motivated us to consider JOBS on the spatially varying genes for the two independent "light" samples.In particular, we selected the top 50 spatially variable genes using SPARK-X, implemented in the R package DR.SC 22 for each of the two independent light samples and considered a common set of genes, which led to 33 spatially varying genes.As a standard pre-processing step for spatial transcriptomics data, we removed cells showing extreme expression of genes from each of the light samples.The data were subsequently log-normalized with a scaling factor equal to the median expression of total reads per cell, following the STARmap study protocol.Thus, the final analysis-ready dataset amounts to the log-normalized expression data for 33 spatially varying genes measured on 927 and 847 single-cells respectively for the two light samples.
We ran the proposed JOBS on the processed dataset, which provided us with the posterior estimate of a spatial correlation matrix for each sample and the shared gene correlation matrix across samples.We further postprocessed these outputs to extract important features, particular to both the sample-specific and shared data.In particular, we obtained the smoothed spatial expression patterns, jointly for the 33 selected genes in each sample.The mean correlation and the mean squared error between the smoothed and observed gene expression values across the two samples were found to be 0.802 and 0.597 respectively.This indicated the high accuracy of the estimation of the spatial cell and gene covariance matrices for the STARmap data.Figure 3 shows the smoothed and observed spatial expression patterns for the genes "Egr1" and "Mgp".Clearly, the smoothed expression patterns are highly aligned with the observed spatial distribution.We considered a Gaussian mixture model (GMM) on the smoothed gene expression data to obtain spatial clustering of the cells, choosing the optimal number of clusters using the Bayesian Information Criterion 23 .We compared the clustering results with two other wellknown spatial clustering methods, namely BayesSpace and DR.SC.To objectively assess the clustering accuracy, we used the manually annotated cell types from the original STARmap study.Since excitatory cells formed a rich class of distinctly identified neurons, we focused our comparison on the subset of major cell types (eL2/3, eL4, eL5, and eL6) constituting excitatory cells.Figure 4 shows the clustering plot for one of the two light samples using the three methods.We looked at the Adjusted Rand Index 24 to demonstrate clustering performance of the three competing methods, comparing the estimated cluster labels with the manually annotated cells types of excitatory cells.The corresponding plot with the true labels as obtained from the STARmap platform is shown in Fig. 4d (we looked at the subset of cell types eL2/3, eL4, eL5, and eL6).Furthermore, since BayesSpace requires the specification of the number of clusters, we ran the algorithm for multiple choices of the number of clusters and report the one with highest accuracy.Clearly, the clustering obtained from JOBS outperforms the other two methods in terms of clustering performance, specifically designed for spatial clustering.This highlights the importance of joint modelling of the gene and spatial correlations in spatial transcriptomic data, which possibly further enhances spatial clustering.
The boxplot of the expression of the top ten spatially varying genes (obtained from SPARK-X) across the different clusters estimated from JOBS in Figure 5 shows interesting distributional pattern.The gene "eRNA3" is seen to be almost uniformly distributed across the clusters, with a relatively high expression pattern.This uniform spatial distribution of the gene "Egr1" for the "light" sample is also validated by STARmap platform.Interestingly, the gene "Bgn" is only significantly expressed in cluster 2. "Bgn" encodes a member of the small leucine-rich proteoglycan family of proteins, which plays a role in bone growth, muscle development, and regeneration 25,26 .The STARmap platform validates most of these cells as smooth muscle cells, which constitute of involuntary, non-striated muscle as seen in Figure 4d.This possibly justifies the up-regulation of the gene "Bgn" in the cluster comprising of smooth muscle cells.Our analysis highlights that the joint modeling approach can aid in the identification of relevant marker genes by clusters.Using the posterior estimates of the spatial correlations, we de-correlated the data and, with graphical LASSO 27 , estimated the network among the selected genes, shown in Figure 6.The estimated network shows that the gene "eRNA3" and "Arx" are hub genes, showing association with multiple genes.Enhancers may be regarded as DNA sequences that regulate the gene expression networks.Enhancer RNAs (eRNAs), which are transcribed from enhancers in a tissue-specific manner, constitute an important class of non-coding RNAs with a multitude of functions involving gene expression regulation 28 .The STARmap study considered eRNAs 1 to 5 of the "Fos" gene and identified "eRNA3" as the most notable and consistent activity-regulated gene marker, which is also highlighted in our estimated network with "eRNA3" being a hub gene.Our estimated network captured this coexpression network between the genes "eRNA3" and "Fos".Concurrently, the "Arx" gene provides instructions for producing a protein that regulates the activity of other genes 29,30 .Dickel et al. 31 found that enhancers near "Arx" gene regulate its transcription in the mouse brain tissue.This possibly justifies the co-expression pattern between the genes "Arx" and "eRNA3" and the genes being hub genes.Furthermore, Figure 7 shows the expression of the co-expressed genes across the different clusters.
The posterior estimate of the row correlation matrix was used to visualize the correlations among the spatially varying genes in Figure 8.The plot shows positive correlation between "Arx" and "eRNA3", which is again consistent with findings from existing literature.The plot shows high negative correlations of the gene "Egr1" with "Arx" and "Prok2".Further, the estimated network in Fig. 6 shows that "Arx" is connected with "Prok2" through the gene "Egr1".This possibly justifies the expression pattern of these genes in clusters 2 and 6, wherein up-regulation of "Arx" down-regulates expression of "Egr1", which in turn up-regulates "Prok2".The estimated correlations give support to the estimated network in Fig. 6 and the spatial distributional patterns in Fig. 5, revealing strong correlations among the co-expressed genes.
In addition to the STARmap data, we considered another two spatial transcriptomic datasets obtained from the 10x Genomics Visium platform.In particular, we considered the DLFPC dataset studied by 17 and the human breast cancer dataset considered by 19 .The detailed description of the datasets and our analyses can be found in the Supplementary Section F. The proposed JOBS-based clustering produced a superior or similar spatial clustering compared to DR.SC and BayesSpace.

Discussion
We have introduced a joint Bayesian method for the estimation of covariance matrices for matrix-variate spatial transcriptomic data wherein both the genes (rows) and cells (columns) of the matrix-variate data are correlated by the very design of the study.We have considered the case where we have multiple independent samples of the www.nature.com/scientificreports/spatial transcriptomic data observed over a possibly different set of spatial locations but a common set of genes.We have illustrated the power of our method using both extensive simulations and real data where we made comparison with existing methods.The post-processed outputs from our method when used for spatial clustering shows improved clustering performance over existing methods.As opposed to existing methods, JOBS can be used to understand gene co-expression network as well as joint-differential analysis of these genes by clusters.
There are a few possible future directions for this work.First, it may be possible to consider spatial transcriptomic studies with large number of observed genes.The challenge is to define the joint distribution over the matrix-variate data, which along with the estimation of covariance matrices would allow for automatic selection of spatially varying genes from the entire gene set through some Bayesian variable selection criterion.It may be also possible to incorporate some Bayesian model-based clustering algorithm for the spatial clustering.Currently, we consider a Markov chain Monte Carlo (MCMC) algorithm to estimate the correlation matrices in our model.This possibly restricts the applicability of our method to large scale spatial transcriptomic datasets.However, it may be possible to consider a variational Bayes approach to estimating the JOBS model parameters, which would significantly speed up computational time.

Joint Covariance estimation for single-sample spatial transcriptomic data
In this section, we briefly present the proposed Bayesian methodology to jointly estimate the covariance matrices of a single-sample matrix-variate spatial transcriptomic data.The detailed methodology is presented in the Supplementary Section A.1.Consider an p × n matrix Y of spatial transcriptomic data where p denotes the number of genes and n denotes the number of cells measured at the spatial locations s 1 , . . ., s n , Here y (ℓ) i is the expression of the ℓ th gene in the ith cell at location s i .We model Y as a centered matrix-normal distribution, where and are the row and column covariance matrices.These correspond to the gene and spatial covariance matrices for the spatial transcriptomic data.We focus on problems where the number of spatial locations n is much larger than the number of genes p.To circumvent computational challenges we consider a sparse approximate Cholesky decomposition, where D = diag(d 1 , . . ., d n ) is a diagonal matrix with positive entries d i > 0 , and U is a unit upper triangular matrix, i.e., an upper triangular matrix with diagonals equal to one.(2) Y ∼ MN p,n (0, �, �),

Figure 1 .
Figure 1.Illustration of our joint Bayesian methodology.

Figure 2 .
Figure 2. Observed and predicted spatial distribution of the gene "SCGB1D2" in the DLFPC dataset.

Figure 3 .
Figure 3. Smoothed and observed spatial expression patterns for two genes corresponding to one of the "light" samples.

Figure 4 . 7 Figure 5 .Figure 6 .
Figure 4. Spatial clustering from our proposed JOBS compared with the state of the art method as implemented by DR.SC R package and the BayesSpace for one of the light samples.The colors indicate the estimated clusters.ARI comparing the estimated cluster labels with the manually annotated cells (shown in (d)) is reported at the top of each panel.

7 Figure 7 .
Figure 7. Boxplot of the expression of the co-expressed genes by cluster.

Figure 8 .
Figure 8. Heatmap of correlation between the spatially varying genes.